{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Inference\n",
    "\n",
    "This notebook contains all the commands for model inference used in Fig. 2 of the following paper:\n",
    "\n",
    "```\n",
    "Hosseini, Nanni and Coll Ardanuy (2020), DeezyMatch: A Flexible Deep Learning Approach to Fuzzy String Matching, EMNLP: System Demonstrations.\n",
    "```\n",
    "\n",
    "Refer to the `Fig2_EMNLP_train` notebook where we train and fine-tune the models.\n",
    "\n",
    "---\n",
    "\n",
    "In this notebook:\n",
    "\n",
    "* skyline1: trained on *OCR* dataset\n",
    "* skyline2: trained on *WG:en+OCR* dataset\n",
    "* baseline: trained on *WG:en* dataset\n",
    "\n",
    "---\n",
    "\n",
    "* model A: both embedding and recurrent units are frozen (i.e., their parameters are not updated during fine-tuning).\n",
    "* model B: only the embedding layer is frozen. \n",
    "\n",
    "---\n",
    "\n",
    "To show the impact of fine-tuning and choice of architecture on the model performance, we trained various models starting with the baseline model and included more training instances from the training set of *OCR*.\n",
    "\n",
    "The performance of these models is then assessed on the *OCR* test set. \n",
    "\n",
    "Refer to the paper for more information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## skyline1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:17:55\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:17:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r",
      "length s1:   0%|          | 0/8508 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:17:58\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:17:58\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n",
      "\u001b[92m2020-09-10 18:17:58\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:18:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:18:02 -- Epoch: 0/0; Test; loss: 0.137; acc: 0.956; precision: 0.949, recall: 0.963, macrof1: 0.956, weightedf1: 0.956\u001b[0m\n",
      "--- 6.686788320541382 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import inference as dm_inference\n",
    "\n",
    "# model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "dm_inference(input_file_path=\"./inputs/input_dfm.yaml\",\n",
    "             dataset_path=\"./dataset/ocr_test.txt\", \n",
    "             pretrained_model_path=\"./models/ocr_001/ocr_001.model\", \n",
    "             pretrained_vocab_path=\"./models/ocr_001/ocr_001.vocab\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:19:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:19:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:19:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:19:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:19:03\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:19:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:19:45 -- Epoch: 0/0; Test; loss: 3.691; acc: 0.642; precision: 0.718, recall: 0.470, macrof1: 0.631, weightedf1: 0.631\u001b[0m\n",
      "--- 43.22682738304138 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import inference as dm_inference\n",
    "\n",
    "# model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "dm_inference(input_file_path=\"./inputs/input_dfm.yaml\",\n",
    "             dataset_path=\"./dataset/wikigaz_en_test.txt\", \n",
    "             pretrained_model_path=\"./models/ocr_001/ocr_001.model\", \n",
    "             pretrained_vocab_path=\"./models/ocr_001/ocr_001.vocab\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## skyline1b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "s1 padding:   0%|          | 0/8508 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:17:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_b.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 22:17:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 22:17:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 22:17:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n",
      "\u001b[92m2020-09-10 22:17:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:17:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_22:17:07 -- Epoch: 0/0; Test; loss: 0.120; acc: 0.964; precision: 0.970, recall: 0.958, macrof1: 0.964, weightedf1: 0.964\u001b[0m\n",
      "--- 3.20565128326416 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import inference as dm_inference\n",
    "\n",
    "# model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "dm_inference(input_file_path=\"./inputs/input_dfm_b.yaml\",\n",
    "             dataset_path=\"./dataset/ocr_test.txt\", \n",
    "             pretrained_model_path=\"./models/ocr_001b/ocr_001b.model\", \n",
    "             pretrained_vocab_path=\"./models/ocr_001b/ocr_001b.vocab\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:17:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_b.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 22:17:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 22:17:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 22:17:08\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:17:09\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:17:57\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_22:17:57 -- Epoch: 0/0; Test; loss: 4.925; acc: 0.650; precision: 0.750, recall: 0.450, macrof1: 0.635, weightedf1: 0.635\u001b[0m\n",
      "--- 49.29198336601257 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import inference as dm_inference\n",
    "\n",
    "# model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "dm_inference(input_file_path=\"./inputs/input_dfm_b.yaml\",\n",
    "             dataset_path=\"./dataset/wikigaz_en_test.txt\", \n",
    "             pretrained_model_path=\"./models/ocr_001b/ocr_001b.model\", \n",
    "             pretrained_vocab_path=\"./models/ocr_001b/ocr_001b.vocab\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## skyline2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:20:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:20:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:20:05\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:20:05\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:20:05\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:20:09\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:20:09 -- Epoch: 0/0; Test; loss: 0.285; acc: 0.881; precision: 0.860, recall: 0.910, macrof1: 0.881, weightedf1: 0.881\u001b[0m\n",
      "--- 4.605390548706055 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import inference as dm_inference\n",
    "\n",
    "# model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "dm_inference(input_file_path=\"./inputs/input_dfm.yaml\",\n",
    "             dataset_path=\"./dataset/ocr_test.txt\", \n",
    "             pretrained_model_path=\"./models/wikigaz_en_ocr_gru_001/wikigaz_en_ocr_gru_001.model\", \n",
    "             pretrained_vocab_path=\"./models/wikigaz_en_ocr_gru_001/wikigaz_en_ocr_gru_001.vocab\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:20:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:20:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:20:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:20:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:20:57\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:21:42\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:21:42 -- Epoch: 0/0; Test; loss: 0.178; acc: 0.925; precision: 0.915, recall: 0.938, macrof1: 0.925, weightedf1: 0.925\u001b[0m\n",
      "--- 46.3150749206543 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import inference as dm_inference\n",
    "\n",
    "# model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "dm_inference(input_file_path=\"./inputs/input_dfm.yaml\",\n",
    "             dataset_path=\"./dataset/wikigaz_en_test.txt\", \n",
    "             pretrained_model_path=\"./models/wikigaz_en_ocr_gru_001/wikigaz_en_ocr_gru_001.model\", \n",
    "             pretrained_vocab_path=\"./models/wikigaz_en_ocr_gru_001/wikigaz_en_ocr_gru_001.vocab\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## skyline2b"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                   "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 23:01:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_b.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 23:01:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 23:01:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 23:01:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n",
      "\u001b[92m2020-09-10 23:01:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 23:01:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_23:01:23 -- Epoch: 0/0; Test; loss: 0.256; acc: 0.895; precision: 0.871, recall: 0.926, macrof1: 0.895, weightedf1: 0.895\u001b[0m\n",
      "--- 3.2803962230682373 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import inference as dm_inference\n",
    "\n",
    "# model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "dm_inference(input_file_path=\"./inputs/input_dfm_b.yaml\",\n",
    "             dataset_path=\"./dataset/ocr_test.txt\", \n",
    "             pretrained_model_path=\"./models/wikigaz_en_ocr_gru_001b/wikigaz_en_ocr_gru_001b.model\", \n",
    "             pretrained_vocab_path=\"./models/wikigaz_en_ocr_gru_001b/wikigaz_en_ocr_gru_001b.vocab\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 23:01:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_b.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 23:01:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 23:01:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 23:01:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 23:01:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 23:01:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_23:01:53 -- Epoch: 0/0; Test; loss: 0.176; acc: 0.926; precision: 0.908, recall: 0.949, macrof1: 0.926, weightedf1: 0.926\u001b[0m\n",
      "--- 29.86833095550537 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import inference as dm_inference\n",
    "\n",
    "# model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "dm_inference(input_file_path=\"./inputs/input_dfm_b.yaml\",\n",
    "             dataset_path=\"./dataset/wikigaz_en_test.txt\", \n",
    "             pretrained_model_path=\"./models/wikigaz_en_ocr_gru_001b/wikigaz_en_ocr_gru_001b.model\", \n",
    "             pretrained_vocab_path=\"./models/wikigaz_en_ocr_gru_001b/wikigaz_en_ocr_gru_001b.vocab\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## baseline1_gru"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/8508 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:22:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:22:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:22:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:22:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n",
      "\u001b[92m2020-09-10 18:22:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:22:49\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:22:49 -- Epoch: 0/0; Test; loss: 1.741; acc: 0.455; precision: 0.451, recall: 0.413, macrof1: 0.454, weightedf1: 0.454\u001b[0m\n",
      "--- 5.1771557331085205 seconds ---\n"
     ]
    }
   ],
   "source": [
    "# model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "dm_inference(input_file_path=\"./inputs/input_dfm.yaml\",\n",
    "             dataset_path=\"./dataset/ocr_test.txt\", \n",
    "             pretrained_model_path=\"./models/wikigaz_en_gru_001/wikigaz_en_gru_001.model\", \n",
    "             pretrained_vocab_path=\"./models/wikigaz_en_gru_001/wikigaz_en_gru_001.vocab\")\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:23:09\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:23:09\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:23:09\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:23:09\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:23:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:23:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:23:56 -- Epoch: 0/0; Test; loss: 0.157; acc: 0.938; precision: 0.937, recall: 0.939, macrof1: 0.938, weightedf1: 0.938\u001b[0m\n",
      "--- 46.95348644256592 seconds ---\n"
     ]
    }
   ],
   "source": [
    "# model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "dm_inference(input_file_path=\"./inputs/input_dfm.yaml\",\n",
    "             dataset_path=\"./dataset/wikigaz_en_test.txt\", \n",
    "             pretrained_model_path=\"./models/wikigaz_en_gru_001/wikigaz_en_gru_001.model\", \n",
    "             pretrained_vocab_path=\"./models/wikigaz_en_gru_001/wikigaz_en_gru_001.vocab\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## baseline1_lstm"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:25:00\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:25:00\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:25:00\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:25:00\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:25:00\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:25:05\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:25:05 -- Epoch: 0/0; Test; loss: 1.824; acc: 0.452; precision: 0.454, recall: 0.473, macrof1: 0.452, weightedf1: 0.452\u001b[0m\n",
      "--- 5.7508580684661865 seconds ---\n"
     ]
    }
   ],
   "source": [
    "# model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "dm_inference(input_file_path=\"./inputs/input_dfm_lstm.yaml\",\n",
    "             dataset_path=\"./dataset/ocr_test.txt\", \n",
    "             pretrained_model_path=\"./models/wikigaz_en_lstm_001/wikigaz_en_lstm_001.model\", \n",
    "             pretrained_vocab_path=\"./models/wikigaz_en_lstm_001/wikigaz_en_lstm_001.vocab\")\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:25:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:25:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:25:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:25:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:25:46\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:26:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:26:35 -- Epoch: 0/0; Test; loss: 0.158; acc: 0.937; precision: 0.936, recall: 0.937, macrof1: 0.937, weightedf1: 0.937\u001b[0m\n",
      "--- 50.00007247924805 seconds ---\n"
     ]
    }
   ],
   "source": [
    "# model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "dm_inference(input_file_path=\"./inputs/input_dfm_lstm.yaml\",\n",
    "             dataset_path=\"./dataset/wikigaz_en_test.txt\", \n",
    "             pretrained_model_path=\"./models/wikigaz_en_lstm_001/wikigaz_en_lstm_001.model\", \n",
    "             pretrained_vocab_path=\"./models/wikigaz_en_lstm_001/wikigaz_en_lstm_001.vocab\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## baseline1_rnn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/8508 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:27:12\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:27:12\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:27:12\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:27:12\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n",
      "\u001b[92m2020-09-10 18:27:12\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:27:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:27:17 -- Epoch: 0/0; Test; loss: 1.243; acc: 0.484; precision: 0.484, recall: 0.505, macrof1: 0.483, weightedf1: 0.483\u001b[0m\n",
      "--- 4.953282833099365 seconds ---\n"
     ]
    }
   ],
   "source": [
    "# model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "dm_inference(input_file_path=\"./inputs/input_dfm_rnn.yaml\",\n",
    "             dataset_path=\"./dataset/ocr_test.txt\", \n",
    "             pretrained_model_path=\"./models/wikigaz_en_rnn_001/wikigaz_en_rnn_001.model\", \n",
    "             pretrained_vocab_path=\"./models/wikigaz_en_rnn_001/wikigaz_en_rnn_001.vocab\")\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:27:42\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:27:42\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:27:42\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:27:42\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:27:43\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:28:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:28:28 -- Epoch: 0/0; Test; loss: 0.219; acc: 0.909; precision: 0.920, recall: 0.897, macrof1: 0.909, weightedf1: 0.909\u001b[0m\n",
      "--- 46.63703799247742 seconds ---\n"
     ]
    }
   ],
   "source": [
    "# model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "dm_inference(input_file_path=\"./inputs/input_dfm_rnn.yaml\",\n",
    "             dataset_path=\"./dataset/wikigaz_en_test.txt\", \n",
    "             pretrained_model_path=\"./models/wikigaz_en_rnn_001/wikigaz_en_rnn_001.model\", \n",
    "             pretrained_vocab_path=\"./models/wikigaz_en_rnn_001/wikigaz_en_rnn_001.vocab\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fine-Tuned, model A, GRU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------- 250\n",
      "\u001b[92m2020-09-10 18:31:05\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:05\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:06\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:06\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:06\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:31:10 -- Epoch: 0/0; Test; loss: 0.905; acc: 0.602; precision: 0.590, recall: 0.669, macrof1: 0.600, weightedf1: 0.600\u001b[0m\n",
      "--- 4.97077751159668 seconds ---\n",
      "--------- 500\n",
      "\u001b[92m2020-09-10 18:31:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:15\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:31:15 -- Epoch: 0/0; Test; loss: 0.715; acc: 0.677; precision: 0.659, recall: 0.732, macrof1: 0.676, weightedf1: 0.676\u001b[0m\n",
      "--- 4.442528486251831 seconds ---\n",
      "--------- 1000\n",
      "\u001b[92m2020-09-10 18:31:15\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:15\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:15\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:15\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:15\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:19\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:31:19 -- Epoch: 0/0; Test; loss: 0.585; acc: 0.743; precision: 0.745, recall: 0.740, macrof1: 0.743, weightedf1: 0.743\u001b[0m\n",
      "--- 4.261778354644775 seconds ---\n",
      "--------- 2000\n",
      "\u001b[92m2020-09-10 18:31:19\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:19\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:19\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:19\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:19\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:31:23 -- Epoch: 0/0; Test; loss: 0.495; acc: 0.787; precision: 0.774, recall: 0.811, macrof1: 0.787, weightedf1: 0.787\u001b[0m\n",
      "--- 4.284550905227661 seconds ---\n",
      "--------- 4000\n",
      "\u001b[92m2020-09-10 18:31:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:31:28 -- Epoch: 0/0; Test; loss: 0.423; acc: 0.824; precision: 0.818, recall: 0.834, macrof1: 0.824, weightedf1: 0.824\u001b[0m\n",
      "--- 4.089308261871338 seconds ---\n",
      "--------- 8000\n",
      "\u001b[92m2020-09-10 18:31:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:31:32 -- Epoch: 0/0; Test; loss: 0.362; acc: 0.851; precision: 0.839, recall: 0.869, macrof1: 0.851, weightedf1: 0.851\u001b[0m\n",
      "--- 4.261324405670166 seconds ---\n",
      "--------- 16000\n",
      "\u001b[92m2020-09-10 18:31:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:36\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:31:36 -- Epoch: 0/0; Test; loss: 0.306; acc: 0.878; precision: 0.861, recall: 0.902, macrof1: 0.878, weightedf1: 0.878\u001b[0m\n",
      "--- 4.272459030151367 seconds ---\n",
      "--------- 32000\n",
      "\u001b[92m2020-09-10 18:31:36\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:36\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:36\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:36\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:36\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:31:40 -- Epoch: 0/0; Test; loss: 0.265; acc: 0.896; precision: 0.898, recall: 0.894, macrof1: 0.896, weightedf1: 0.896\u001b[0m\n",
      "--- 4.092085123062134 seconds ---\n",
      "--------- 64000\n",
      "\u001b[92m2020-09-10 18:31:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:31:44 -- Epoch: 0/0; Test; loss: 0.230; acc: 0.915; precision: 0.909, recall: 0.922, macrof1: 0.915, weightedf1: 0.915\u001b[0m\n",
      "--- 4.144334077835083 seconds ---\n",
      "--------- 84000\n",
      "\u001b[92m2020-09-10 18:31:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:31:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:31:49\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:31:49 -- Epoch: 0/0; Test; loss: 0.211; acc: 0.924; precision: 0.914, recall: 0.936, macrof1: 0.924, weightedf1: 0.924\u001b[0m\n",
      "--- 4.116542816162109 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import finetune as dm_finetune\n",
    "\n",
    "for n_ft_examples in [250, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 84000]:\n",
    "    print(\"---------\", n_ft_examples)\n",
    "    # model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "    dm_inference(input_file_path=\"./inputs/input_dfm_gru_model_A.yaml\",\n",
    "             dataset_path=\"./dataset/ocr_test.txt\", \n",
    "             pretrained_model_path=f\"./models/wikigaz_en_ft_ocr_gru_v001_n{n_ft_examples}/wikigaz_en_ft_ocr_gru_v001_n{n_ft_examples}.model\", \n",
    "             pretrained_vocab_path=f\"./models/wikigaz_en_ft_ocr_gru_v001_n{n_ft_examples}/wikigaz_en_ft_ocr_gru_v001_n{n_ft_examples}.vocab\")\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------- 250\n",
      "\u001b[92m2020-09-10 18:33:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:33:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:33:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:33:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:33:33\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:34:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:34:14 -- Epoch: 0/0; Test; loss: 0.241; acc: 0.905; precision: 0.889, recall: 0.927, macrof1: 0.905, weightedf1: 0.905\u001b[0m\n",
      "--- 41.99560475349426 seconds ---\n",
      "--------- 500\n",
      "\u001b[92m2020-09-10 18:34:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:34:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:34:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:34:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:34:15\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:34:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:34:56 -- Epoch: 0/0; Test; loss: 0.320; acc: 0.874; precision: 0.857, recall: 0.897, macrof1: 0.874, weightedf1: 0.874\u001b[0m\n",
      "--- 41.614906311035156 seconds ---\n",
      "--------- 1000\n",
      "\u001b[92m2020-09-10 18:34:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:34:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:34:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:34:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:34:57\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:35:41\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:35:41 -- Epoch: 0/0; Test; loss: 0.457; acc: 0.818; precision: 0.841, recall: 0.785, macrof1: 0.818, weightedf1: 0.818\u001b[0m\n",
      "--- 44.90234637260437 seconds ---\n",
      "--------- 2000\n",
      "\u001b[92m2020-09-10 18:35:41\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:35:41\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:35:41\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:35:41\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:35:42\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:36:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:36:25 -- Epoch: 0/0; Test; loss: 0.682; acc: 0.770; precision: 0.787, recall: 0.739, macrof1: 0.769, weightedf1: 0.769\u001b[0m\n",
      "--- 44.764214277267456 seconds ---\n",
      "--------- 4000\n",
      "\u001b[92m2020-09-10 18:36:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:36:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:36:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:36:26\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:36:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:37:09\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:37:09 -- Epoch: 0/0; Test; loss: 0.853; acc: 0.748; precision: 0.777, recall: 0.695, macrof1: 0.747, weightedf1: 0.747\u001b[0m\n",
      "--- 44.099066734313965 seconds ---\n",
      "--------- 8000\n",
      "\u001b[92m2020-09-10 18:37:09\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:37:09\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:37:09\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:37:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:37:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:37:58\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:37:58 -- Epoch: 0/0; Test; loss: 1.304; acc: 0.709; precision: 0.754, recall: 0.621, macrof1: 0.707, weightedf1: 0.707\u001b[0m\n",
      "--- 48.10148477554321 seconds ---\n",
      "--------- 16000\n",
      "\u001b[92m2020-09-10 18:37:58\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:37:58\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:37:58\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:37:58\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:37:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:38:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:38:45 -- Epoch: 0/0; Test; loss: 1.234; acc: 0.707; precision: 0.737, recall: 0.644, macrof1: 0.706, weightedf1: 0.706\u001b[0m\n",
      "--- 47.47420120239258 seconds ---\n",
      "--------- 32000\n",
      "\u001b[92m2020-09-10 18:38:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:38:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:38:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:38:46\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:38:46\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:39:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:39:30 -- Epoch: 0/0; Test; loss: 1.532; acc: 0.690; precision: 0.734, recall: 0.598, macrof1: 0.688, weightedf1: 0.688\u001b[0m\n",
      "--- 45.29329752922058 seconds ---\n",
      "--------- 64000\n",
      "\u001b[92m2020-09-10 18:39:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:39:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:39:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:39:31\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:39:31\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:40:18\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:40:18 -- Epoch: 0/0; Test; loss: 1.673; acc: 0.690; precision: 0.729, recall: 0.607, macrof1: 0.688, weightedf1: 0.688\u001b[0m\n",
      "--- 48.14600992202759 seconds ---\n",
      "--------- 84000\n",
      "\u001b[92m2020-09-10 18:40:18\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:40:18\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:40:19\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:40:19\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:40:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:41:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:41:07 -- Epoch: 0/0; Test; loss: 1.866; acc: 0.685; precision: 0.718, recall: 0.608, macrof1: 0.683, weightedf1: 0.683\u001b[0m\n",
      "--- 48.287535429000854 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import finetune as dm_finetune\n",
    "\n",
    "for n_ft_examples in [250, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 84000]:\n",
    "    print(\"---------\", n_ft_examples)\n",
    "    # model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "    dm_inference(input_file_path=\"./inputs/input_dfm_gru_model_A.yaml\",\n",
    "             dataset_path=\"./dataset/wikigaz_en_test.txt\", \n",
    "             pretrained_model_path=f\"./models/wikigaz_en_ft_ocr_gru_v001_n{n_ft_examples}/wikigaz_en_ft_ocr_gru_v001_n{n_ft_examples}.model\", \n",
    "             pretrained_vocab_path=f\"./models/wikigaz_en_ft_ocr_gru_v001_n{n_ft_examples}/wikigaz_en_ft_ocr_gru_v001_n{n_ft_examples}.vocab\")\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fine-Tuned, model A, LSTM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/8508 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------- 250\n",
      "\u001b[92m2020-09-10 18:45:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:45:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:45:28 -- Epoch: 0/0; Test; loss: 1.055; acc: 0.605; precision: 0.593, recall: 0.672, macrof1: 0.603, weightedf1: 0.603\u001b[0m\n",
      "--- 4.489645957946777 seconds ---\n",
      "--------- 500\n",
      "\u001b[92m2020-09-10 18:45:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:45:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:45:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:45:35 -- Epoch: 0/0; Test; loss: 0.791; acc: 0.691; precision: 0.680, recall: 0.722, macrof1: 0.691, weightedf1: 0.691\u001b[0m\n",
      "--- 6.434430360794067 seconds ---\n",
      "--------- 1000\n",
      "\u001b[92m2020-09-10 18:45:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:45:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:45:41\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:45:41 -- Epoch: 0/0; Test; loss: 0.588; acc: 0.763; precision: 0.757, recall: 0.774, macrof1: 0.763, weightedf1: 0.763\u001b[0m\n",
      "--- 6.529709815979004 seconds ---\n",
      "--------- 2000\n",
      "\u001b[92m2020-09-10 18:45:41\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:41\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:41\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:41\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:45:41\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:45:48\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:45:48 -- Epoch: 0/0; Test; loss: 0.457; acc: 0.812; precision: 0.806, recall: 0.824, macrof1: 0.812, weightedf1: 0.812\u001b[0m\n",
      "--- 6.572016477584839 seconds ---\n",
      "--------- 4000\n",
      "\u001b[92m2020-09-10 18:45:48\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:48\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:48\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:48\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:45:48\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:45:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:45:54 -- Epoch: 0/0; Test; loss: 0.400; acc: 0.835; precision: 0.831, recall: 0.840, macrof1: 0.835, weightedf1: 0.835\u001b[0m\n",
      "--- 6.511359691619873 seconds ---\n",
      "--------- 8000\n",
      "\u001b[92m2020-09-10 18:45:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:45:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:45:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:46:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:46:01 -- Epoch: 0/0; Test; loss: 0.339; acc: 0.863; precision: 0.860, recall: 0.867, macrof1: 0.863, weightedf1: 0.863\u001b[0m\n",
      "--- 6.453651428222656 seconds ---\n",
      "--------- 16000\n",
      "\u001b[92m2020-09-10 18:46:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:46:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:46:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:46:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:46:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:46:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:46:07 -- Epoch: 0/0; Test; loss: 0.281; acc: 0.892; precision: 0.884, recall: 0.904, macrof1: 0.892, weightedf1: 0.892\u001b[0m\n",
      "--- 6.4285571575164795 seconds ---\n",
      "--------- 32000\n",
      "\u001b[92m2020-09-10 18:46:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:46:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:46:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:46:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n",
      "\u001b[92m2020-09-10 18:46:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:46:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:46:14 -- Epoch: 0/0; Test; loss: 0.246; acc: 0.906; precision: 0.901, recall: 0.911, macrof1: 0.906, weightedf1: 0.906\u001b[0m\n",
      "--- 6.5677735805511475 seconds ---\n",
      "--------- 64000\n",
      "\u001b[92m2020-09-10 18:46:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:46:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:46:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:46:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:46:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:46:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:46:20 -- Epoch: 0/0; Test; loss: 0.202; acc: 0.923; precision: 0.916, recall: 0.931, macrof1: 0.923, weightedf1: 0.923\u001b[0m\n",
      "--- 6.509921312332153 seconds ---\n",
      "--------- 84000\n",
      "\u001b[92m2020-09-10 18:46:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:46:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:46:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:46:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:46:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:46:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:46:27 -- Epoch: 0/0; Test; loss: 0.187; acc: 0.931; precision: 0.928, recall: 0.935, macrof1: 0.931, weightedf1: 0.931\u001b[0m\n",
      "--- 6.3850836753845215 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import finetune as dm_finetune\n",
    "\n",
    "for n_ft_examples in [250, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 84000]:\n",
    "    print(\"---------\", n_ft_examples)\n",
    "    # model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "    dm_inference(input_file_path=\"./inputs/input_dfm_lstm_model_A.yaml\",\n",
    "             dataset_path=\"./dataset/ocr_test.txt\", \n",
    "             pretrained_model_path=f\"./models/wikigaz_en_ft_ocr_lstm_v001_n{n_ft_examples}/wikigaz_en_ft_ocr_lstm_v001_n{n_ft_examples}.model\", \n",
    "             pretrained_vocab_path=f\"./models/wikigaz_en_ft_ocr_lstm_v001_n{n_ft_examples}/wikigaz_en_ft_ocr_lstm_v001_n{n_ft_examples}.vocab\")\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------- 250\n",
      "\u001b[92m2020-09-10 18:46:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:46:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:46:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:46:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:46:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:47:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:47:24 -- Epoch: 0/0; Test; loss: 0.240; acc: 0.907; precision: 0.909, recall: 0.905, macrof1: 0.907, weightedf1: 0.907\u001b[0m\n",
      "--- 57.79513621330261 seconds ---\n",
      "--------- 500\n",
      "\u001b[92m2020-09-10 18:47:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:47:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:47:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:47:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:47:26\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:48:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:48:27 -- Epoch: 0/0; Test; loss: 0.333; acc: 0.876; precision: 0.883, recall: 0.867, macrof1: 0.876, weightedf1: 0.876\u001b[0m\n",
      "--- 62.41485595703125 seconds ---\n",
      "--------- 1000\n",
      "\u001b[92m2020-09-10 18:48:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:48:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:48:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:48:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:48:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:49:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:49:25 -- Epoch: 0/0; Test; loss: 0.504; acc: 0.826; precision: 0.854, recall: 0.788, macrof1: 0.826, weightedf1: 0.826\u001b[0m\n",
      "--- 58.05944013595581 seconds ---\n",
      "--------- 2000\n",
      "\u001b[92m2020-09-10 18:49:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:49:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:49:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:49:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:49:26\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:50:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:50:27 -- Epoch: 0/0; Test; loss: 0.689; acc: 0.787; precision: 0.825, recall: 0.730, macrof1: 0.787, weightedf1: 0.787\u001b[0m\n",
      "--- 62.48609185218811 seconds ---\n",
      "--------- 4000\n",
      "\u001b[92m2020-09-10 18:50:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:50:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:50:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:50:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:50:29\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:51:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:51:25 -- Epoch: 0/0; Test; loss: 0.725; acc: 0.775; precision: 0.812, recall: 0.718, macrof1: 0.775, weightedf1: 0.775\u001b[0m\n",
      "--- 57.90281414985657 seconds ---\n",
      "--------- 8000\n",
      "\u001b[92m2020-09-10 18:51:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:51:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:51:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:51:26\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:51:26\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:52:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:52:25 -- Epoch: 0/0; Test; loss: 0.871; acc: 0.746; precision: 0.776, recall: 0.690, macrof1: 0.745, weightedf1: 0.745\u001b[0m\n",
      "--- 59.80358624458313 seconds ---\n",
      "--------- 16000\n",
      "\u001b[92m2020-09-10 18:52:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:52:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:52:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:52:26\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:52:26\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:53:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:53:27 -- Epoch: 0/0; Test; loss: 1.314; acc: 0.709; precision: 0.737, recall: 0.652, macrof1: 0.708, weightedf1: 0.708\u001b[0m\n",
      "--- 61.88062405586243 seconds ---\n",
      "--------- 32000\n",
      "\u001b[92m2020-09-10 18:53:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:53:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:53:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:53:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:53:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:54:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:54:24 -- Epoch: 0/0; Test; loss: 1.387; acc: 0.713; precision: 0.753, recall: 0.634, macrof1: 0.711, weightedf1: 0.711\u001b[0m\n",
      "--- 57.35245490074158 seconds ---\n",
      "--------- 64000\n",
      "\u001b[92m2020-09-10 18:54:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:54:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:54:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:54:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:54:26\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:55:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:55:27 -- Epoch: 0/0; Test; loss: 1.594; acc: 0.708; precision: 0.757, recall: 0.613, macrof1: 0.706, weightedf1: 0.706\u001b[0m\n",
      "--- 62.685957193374634 seconds ---\n",
      "--------- 84000\n",
      "\u001b[92m2020-09-10 18:55:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:55:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:55:27\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:55:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:55:28\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:56:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:56:24 -- Epoch: 0/0; Test; loss: 1.829; acc: 0.705; precision: 0.758, recall: 0.603, macrof1: 0.702, weightedf1: 0.702\u001b[0m\n",
      "--- 57.40061807632446 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import finetune as dm_finetune\n",
    "\n",
    "for n_ft_examples in [250, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 84000]:\n",
    "    print(\"---------\", n_ft_examples)\n",
    "    # model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "    dm_inference(input_file_path=\"./inputs/input_dfm_lstm_model_A.yaml\",\n",
    "             dataset_path=\"./dataset/wikigaz_en_test.txt\", \n",
    "             pretrained_model_path=f\"./models/wikigaz_en_ft_ocr_lstm_v001_n{n_ft_examples}/wikigaz_en_ft_ocr_lstm_v001_n{n_ft_examples}.model\", \n",
    "             pretrained_vocab_path=f\"./models/wikigaz_en_ft_ocr_lstm_v001_n{n_ft_examples}/wikigaz_en_ft_ocr_lstm_v001_n{n_ft_examples}.vocab\")\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fine-Tuned, model A, RNN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------- 250\n",
      "\u001b[92m2020-09-10 18:56:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:56:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:56:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:56:30 -- Epoch: 0/0; Test; loss: 0.759; acc: 0.593; precision: 0.587, recall: 0.626, macrof1: 0.593, weightedf1: 0.593\u001b[0m\n",
      "--- 5.469081401824951 seconds ---\n",
      "--------- 500\n",
      "\u001b[92m2020-09-10 18:56:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:56:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:56:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:56:35 -- Epoch: 0/0; Test; loss: 0.636; acc: 0.691; precision: 0.691, recall: 0.691, macrof1: 0.691, weightedf1: 0.691\u001b[0m\n",
      "--- 5.20325493812561 seconds ---\n",
      "--------- 1000\n",
      "\u001b[92m2020-09-10 18:56:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:56:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:56:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:56:40 -- Epoch: 0/0; Test; loss: 0.583; acc: 0.727; precision: 0.716, recall: 0.752, macrof1: 0.727, weightedf1: 0.727\u001b[0m\n",
      "--- 5.0874879360198975 seconds ---\n",
      "--------- 2000\n",
      "\u001b[92m2020-09-10 18:56:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:56:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:56:46\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:56:46 -- Epoch: 0/0; Test; loss: 0.538; acc: 0.752; precision: 0.725, recall: 0.813, macrof1: 0.751, weightedf1: 0.751\u001b[0m\n",
      "--- 5.983654022216797 seconds ---\n",
      "--------- 4000\n",
      "\u001b[92m2020-09-10 18:56:46\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:46\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:46\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:56:47\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:47\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:56:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:56:53 -- Epoch: 0/0; Test; loss: 0.470; acc: 0.783; precision: 0.776, recall: 0.795, macrof1: 0.783, weightedf1: 0.783\u001b[0m\n",
      "--- 6.274932622909546 seconds ---\n",
      "--------- 8000\n",
      "\u001b[92m2020-09-10 18:56:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:56:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:56:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:56:59 -- Epoch: 0/0; Test; loss: 0.415; acc: 0.816; precision: 0.796, recall: 0.850, macrof1: 0.816, weightedf1: 0.816\u001b[0m\n",
      "--- 6.120548963546753 seconds ---\n",
      "--------- 16000\n",
      "\u001b[92m2020-09-10 18:56:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:56:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:56:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:57:05\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:57:05 -- Epoch: 0/0; Test; loss: 0.377; acc: 0.838; precision: 0.802, recall: 0.897, macrof1: 0.837, weightedf1: 0.837\u001b[0m\n",
      "--- 6.093186855316162 seconds ---\n",
      "--------- 32000\n",
      "\u001b[92m2020-09-10 18:57:05\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:57:05\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:57:05\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:57:05\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n",
      "\u001b[92m2020-09-10 18:57:05\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:57:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:57:11 -- Epoch: 0/0; Test; loss: 0.346; acc: 0.851; precision: 0.845, recall: 0.860, macrof1: 0.851, weightedf1: 0.851\u001b[0m\n",
      "--- 6.3465282917022705 seconds ---\n",
      "--------- 64000\n",
      "\u001b[92m2020-09-10 18:57:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:57:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:57:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:57:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:57:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:57:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:57:17 -- Epoch: 0/0; Test; loss: 0.325; acc: 0.865; precision: 0.844, recall: 0.894, macrof1: 0.864, weightedf1: 0.864\u001b[0m\n",
      "--- 6.091903209686279 seconds ---\n",
      "--------- 84000\n",
      "\u001b[92m2020-09-10 18:57:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:57:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:57:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:57:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:57:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:57:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:57:23 -- Epoch: 0/0; Test; loss: 0.306; acc: 0.873; precision: 0.861, recall: 0.889, macrof1: 0.873, weightedf1: 0.873\u001b[0m\n",
      "--- 6.133760690689087 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import finetune as dm_finetune\n",
    "\n",
    "for n_ft_examples in [250, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 84000]:\n",
    "    print(\"---------\", n_ft_examples)\n",
    "    # model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "    dm_inference(input_file_path=\"./inputs/input_dfm_rnn_model_A.yaml\",\n",
    "             dataset_path=\"./dataset/ocr_test.txt\", \n",
    "             pretrained_model_path=f\"./models/wikigaz_en_ft_ocr_rnn_v001_n{n_ft_examples}/wikigaz_en_ft_ocr_rnn_v001_n{n_ft_examples}.model\", \n",
    "             pretrained_vocab_path=f\"./models/wikigaz_en_ft_ocr_rnn_v001_n{n_ft_examples}/wikigaz_en_ft_ocr_rnn_v001_n{n_ft_examples}.vocab\")\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------- 250\n",
      "\u001b[92m2020-09-10 18:57:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:57:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:57:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:57:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:57:25\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:58:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:58:10 -- Epoch: 0/0; Test; loss: 0.320; acc: 0.858; precision: 0.863, recall: 0.851, macrof1: 0.858, weightedf1: 0.858\u001b[0m\n",
      "--- 46.1121187210083 seconds ---\n",
      "--------- 500\n",
      "\u001b[92m2020-09-10 18:58:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:58:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:58:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:58:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:58:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:59:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:59:04 -- Epoch: 0/0; Test; loss: 0.550; acc: 0.776; precision: 0.805, recall: 0.727, macrof1: 0.775, weightedf1: 0.775\u001b[0m\n",
      "--- 54.828999280929565 seconds ---\n",
      "--------- 1000\n",
      "\u001b[92m2020-09-10 18:59:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:59:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:59:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 18:59:05\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:59:06\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 18:59:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_18:59:59 -- Epoch: 0/0; Test; loss: 0.557; acc: 0.765; precision: 0.784, recall: 0.731, macrof1: 0.764, weightedf1: 0.764\u001b[0m\n",
      "--- 54.77786135673523 seconds ---\n",
      "--------- 2000\n",
      "\u001b[92m2020-09-10 18:59:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 18:59:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 18:59:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 19:00:00\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 19:00:00\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 19:00:55\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_19:00:55 -- Epoch: 0/0; Test; loss: 0.583; acc: 0.754; precision: 0.771, recall: 0.723, macrof1: 0.754, weightedf1: 0.754\u001b[0m\n",
      "--- 55.344016551971436 seconds ---\n",
      "--------- 4000\n",
      "\u001b[92m2020-09-10 19:00:55\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 19:00:55\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 19:00:55\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 19:00:55\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 19:00:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 19:01:49\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_19:01:49 -- Epoch: 0/0; Test; loss: 0.758; acc: 0.726; precision: 0.767, recall: 0.650, macrof1: 0.724, weightedf1: 0.724\u001b[0m\n",
      "--- 54.62463927268982 seconds ---\n",
      "--------- 8000\n",
      "\u001b[92m2020-09-10 19:01:49\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 19:01:49\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 19:01:49\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 19:01:50\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 19:01:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 19:02:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_19:02:45 -- Epoch: 0/0; Test; loss: 1.096; acc: 0.681; precision: 0.721, recall: 0.592, macrof1: 0.679, weightedf1: 0.679\u001b[0m\n",
      "--- 55.372541189193726 seconds ---\n",
      "--------- 16000\n",
      "\u001b[92m2020-09-10 19:02:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 19:02:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 19:02:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 19:02:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 19:02:46\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 19:03:39\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_19:03:39 -- Epoch: 0/0; Test; loss: 1.301; acc: 0.669; precision: 0.702, recall: 0.587, macrof1: 0.667, weightedf1: 0.667\u001b[0m\n",
      "--- 54.78247356414795 seconds ---\n",
      "--------- 32000\n",
      "\u001b[92m2020-09-10 19:03:39\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 19:03:39\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 19:03:39\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 19:03:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 19:03:41\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 19:04:34\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_19:04:34 -- Epoch: 0/0; Test; loss: 1.453; acc: 0.637; precision: 0.685, recall: 0.507, macrof1: 0.631, weightedf1: 0.631\u001b[0m\n",
      "--- 54.39713954925537 seconds ---\n",
      "--------- 64000\n",
      "\u001b[92m2020-09-10 19:04:34\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 19:04:34\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 19:04:34\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 19:04:34\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 19:04:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 19:05:29\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_19:05:29 -- Epoch: 0/0; Test; loss: 1.530; acc: 0.662; precision: 0.702, recall: 0.564, macrof1: 0.659, weightedf1: 0.659\u001b[0m\n",
      "--- 55.21286368370056 seconds ---\n",
      "--------- 84000\n",
      "\u001b[92m2020-09-10 19:05:29\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_A.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 19:05:29\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 19:05:29\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 19:05:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 19:05:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 19:06:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_19:06:24 -- Epoch: 0/0; Test; loss: 1.523; acc: 0.656; precision: 0.702, recall: 0.543, macrof1: 0.652, weightedf1: 0.652\u001b[0m\n",
      "--- 54.83734583854675 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import finetune as dm_finetune\n",
    "\n",
    "for n_ft_examples in [250, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 84000]:\n",
    "    print(\"---------\", n_ft_examples)\n",
    "    # model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "    dm_inference(input_file_path=\"./inputs/input_dfm_rnn_model_A.yaml\",\n",
    "             dataset_path=\"./dataset/wikigaz_en_test.txt\", \n",
    "             pretrained_model_path=f\"./models/wikigaz_en_ft_ocr_rnn_v001_n{n_ft_examples}/wikigaz_en_ft_ocr_rnn_v001_n{n_ft_examples}.model\", \n",
    "             pretrained_vocab_path=f\"./models/wikigaz_en_ft_ocr_rnn_v001_n{n_ft_examples}/wikigaz_en_ft_ocr_rnn_v001_n{n_ft_examples}.vocab\")\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fine-Tuned, model B, GRU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "s1 padding:   0%|          | 0/8508 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------- 250\n",
      "\u001b[92m2020-09-10 21:45:47\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:45:47\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:45:47\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:45:47\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n",
      "\u001b[92m2020-09-10 21:45:47\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:45:50\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:45:50 -- Epoch: 0/0; Test; loss: 0.953; acc: 0.578; precision: 0.567, recall: 0.661, macrof1: 0.575, weightedf1: 0.575\u001b[0m\n",
      "--- 3.2879772186279297 seconds ---\n",
      "--------- 500\n",
      "\u001b[92m2020-09-10 21:45:50\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:45:50\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:45:50\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:45:50\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:45:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:45:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:45:54 -- Epoch: 0/0; Test; loss: 0.753; acc: 0.637; precision: 0.621, recall: 0.701, macrof1: 0.635, weightedf1: 0.635\u001b[0m\n",
      "--- 3.240983009338379 seconds ---\n",
      "--------- 1000\n",
      "\u001b[92m2020-09-10 21:45:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:45:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:45:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:45:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:45:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:45:57\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:45:57 -- Epoch: 0/0; Test; loss: 0.615; acc: 0.730; precision: 0.715, recall: 0.764, macrof1: 0.730, weightedf1: 0.730\u001b[0m\n",
      "--- 3.3224997520446777 seconds ---\n",
      "--------- 2000\n",
      "\u001b[92m2020-09-10 21:45:57\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:45:57\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:45:57\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:45:57\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:45:57\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:00\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:46:00 -- Epoch: 0/0; Test; loss: 0.506; acc: 0.792; precision: 0.788, recall: 0.799, macrof1: 0.792, weightedf1: 0.792\u001b[0m\n",
      "--- 3.4533793926239014 seconds ---\n",
      "--------- 4000\n",
      "\u001b[92m2020-09-10 21:46:00\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:00\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:00\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:46:04 -- Epoch: 0/0; Test; loss: 0.397; acc: 0.846; precision: 0.844, recall: 0.849, macrof1: 0.846, weightedf1: 0.846\u001b[0m\n",
      "--- 3.291226863861084 seconds ---\n",
      "--------- 8000\n",
      "\u001b[92m2020-09-10 21:46:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:46:07 -- Epoch: 0/0; Test; loss: 0.299; acc: 0.886; precision: 0.880, recall: 0.894, macrof1: 0.886, weightedf1: 0.886\u001b[0m\n",
      "--- 3.325597047805786 seconds ---\n",
      "--------- 16000\n",
      "\u001b[92m2020-09-10 21:46:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:07\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:46:10 -- Epoch: 0/0; Test; loss: 0.230; acc: 0.915; precision: 0.907, recall: 0.925, macrof1: 0.915, weightedf1: 0.915\u001b[0m\n",
      "--- 3.320563554763794 seconds ---\n",
      "--------- 32000\n",
      "\u001b[92m2020-09-10 21:46:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:10\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:46:14 -- Epoch: 0/0; Test; loss: 0.175; acc: 0.935; precision: 0.936, recall: 0.934, macrof1: 0.935, weightedf1: 0.935\u001b[0m\n",
      "--- 3.4649336338043213 seconds ---\n",
      "--------- 64000\n",
      "\u001b[92m2020-09-10 21:46:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:14\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:46:17 -- Epoch: 0/0; Test; loss: 0.146; acc: 0.951; precision: 0.963, recall: 0.939, macrof1: 0.951, weightedf1: 0.951\u001b[0m\n",
      "--- 3.277069330215454 seconds ---\n",
      "--------- 84000\n",
      "\u001b[92m2020-09-10 21:46:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:46:20 -- Epoch: 0/0; Test; loss: 0.122; acc: 0.959; precision: 0.951, recall: 0.967, macrof1: 0.959, weightedf1: 0.959\u001b[0m\n",
      "--- 3.261852979660034 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import finetune as dm_finetune\n",
    "\n",
    "for n_ft_examples in [250, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 84000]:\n",
    "    print(\"---------\", n_ft_examples)\n",
    "    # model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "    dm_inference(input_file_path=\"./inputs/input_dfm_gru_model_B.yaml\",\n",
    "             dataset_path=\"./dataset/ocr_test.txt\", \n",
    "             pretrained_model_path=f\"./models/wikigaz_en_ft_ocr_gru_v002_n{n_ft_examples}/wikigaz_en_ft_ocr_gru_v002_n{n_ft_examples}.model\", \n",
    "             pretrained_vocab_path=f\"./models/wikigaz_en_ft_ocr_gru_v002_n{n_ft_examples}/wikigaz_en_ft_ocr_gru_v002_n{n_ft_examples}.vocab\")\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------- 250\n",
      "\u001b[92m2020-09-10 21:46:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:20\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:21\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:22\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:46:51 -- Epoch: 0/0; Test; loss: 0.212; acc: 0.913; precision: 0.887, recall: 0.948, macrof1: 0.913, weightedf1: 0.913\u001b[0m\n",
      "--- 30.192748546600342 seconds ---\n",
      "--------- 500\n",
      "\u001b[92m2020-09-10 21:46:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:46:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:46:52\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:47:21\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:47:21 -- Epoch: 0/0; Test; loss: 0.247; acc: 0.899; precision: 0.868, recall: 0.941, macrof1: 0.899, weightedf1: 0.899\u001b[0m\n",
      "--- 30.098008632659912 seconds ---\n",
      "--------- 1000\n",
      "\u001b[92m2020-09-10 21:47:21\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:47:21\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:47:21\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:47:21\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:47:22\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:47:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:47:51 -- Epoch: 0/0; Test; loss: 0.337; acc: 0.865; precision: 0.854, recall: 0.881, macrof1: 0.865, weightedf1: 0.865\u001b[0m\n",
      "--- 30.384669303894043 seconds ---\n",
      "--------- 2000\n",
      "\u001b[92m2020-09-10 21:47:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:47:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:47:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:47:52\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:47:52\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:48:21\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:48:21 -- Epoch: 0/0; Test; loss: 0.454; acc: 0.823; precision: 0.827, recall: 0.816, macrof1: 0.823, weightedf1: 0.823\u001b[0m\n",
      "--- 30.343998193740845 seconds ---\n",
      "--------- 4000\n",
      "\u001b[92m2020-09-10 21:48:21\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:48:22\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:48:22\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:48:22\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:48:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:48:52\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:48:52 -- Epoch: 0/0; Test; loss: 0.685; acc: 0.779; precision: 0.800, recall: 0.744, macrof1: 0.778, weightedf1: 0.778\u001b[0m\n",
      "--- 30.623249769210815 seconds ---\n",
      "--------- 8000\n",
      "\u001b[92m2020-09-10 21:48:52\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:48:52\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:48:52\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:48:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:48:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:49:22\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:49:22 -- Epoch: 0/0; Test; loss: 0.966; acc: 0.735; precision: 0.768, recall: 0.674, macrof1: 0.734, weightedf1: 0.734\u001b[0m\n",
      "--- 30.20074987411499 seconds ---\n",
      "--------- 16000\n",
      "\u001b[92m2020-09-10 21:49:22\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:49:22\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:49:22\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:49:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:49:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:49:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:49:53 -- Epoch: 0/0; Test; loss: 1.193; acc: 0.712; precision: 0.751, recall: 0.633, macrof1: 0.710, weightedf1: 0.710\u001b[0m\n",
      "--- 30.544854164123535 seconds ---\n",
      "--------- 32000\n",
      "\u001b[92m2020-09-10 21:49:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:49:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:49:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:49:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:49:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:50:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:50:23 -- Epoch: 0/0; Test; loss: 1.727; acc: 0.703; precision: 0.767, recall: 0.584, macrof1: 0.699, weightedf1: 0.699\u001b[0m\n",
      "--- 29.904095888137817 seconds ---\n",
      "--------- 64000\n",
      "\u001b[92m2020-09-10 21:50:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:50:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:50:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:50:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:50:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:50:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:50:53 -- Epoch: 0/0; Test; loss: 1.546; acc: 0.722; precision: 0.801, recall: 0.590, macrof1: 0.717, weightedf1: 0.717\u001b[0m\n",
      "--- 30.084099531173706 seconds ---\n",
      "--------- 84000\n",
      "\u001b[92m2020-09-10 21:50:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_gru_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:50:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:50:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:50:53\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:50:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:22\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:51:22 -- Epoch: 0/0; Test; loss: 1.654; acc: 0.709; precision: 0.769, recall: 0.598, macrof1: 0.705, weightedf1: 0.705\u001b[0m\n",
      "--- 29.485949993133545 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import finetune as dm_finetune\n",
    "\n",
    "for n_ft_examples in [250, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 84000]:\n",
    "    print(\"---------\", n_ft_examples)\n",
    "    # model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "    dm_inference(input_file_path=\"./inputs/input_dfm_gru_model_B.yaml\",\n",
    "             dataset_path=\"./dataset/wikigaz_en_test.txt\", \n",
    "             pretrained_model_path=f\"./models/wikigaz_en_ft_ocr_gru_v002_n{n_ft_examples}/wikigaz_en_ft_ocr_gru_v002_n{n_ft_examples}.model\", \n",
    "             pretrained_vocab_path=f\"./models/wikigaz_en_ft_ocr_gru_v002_n{n_ft_examples}/wikigaz_en_ft_ocr_gru_v002_n{n_ft_examples}.vocab\")\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fine-Tuned, model B, LSTM"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------- 250\n",
      "\u001b[92m2020-09-10 21:51:22\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:22\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:22\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:26\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:51:26 -- Epoch: 0/0; Test; loss: 1.083; acc: 0.594; precision: 0.581, recall: 0.675, macrof1: 0.591, weightedf1: 0.591\u001b[0m\n",
      "--- 3.68133282661438 seconds ---\n",
      "--------- 500\n",
      "\u001b[92m2020-09-10 21:51:26\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:26\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:26\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:26\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:26\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:51:30 -- Epoch: 0/0; Test; loss: 0.882; acc: 0.646; precision: 0.631, recall: 0.703, macrof1: 0.645, weightedf1: 0.645\u001b[0m\n",
      "--- 3.484748601913452 seconds ---\n",
      "--------- 1000\n",
      "\u001b[92m2020-09-10 21:51:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:30\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:33\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:51:33 -- Epoch: 0/0; Test; loss: 0.651; acc: 0.735; precision: 0.728, recall: 0.750, macrof1: 0.735, weightedf1: 0.735\u001b[0m\n",
      "--- 3.5071098804473877 seconds ---\n",
      "--------- 2000\n",
      "\u001b[92m2020-09-10 21:51:33\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:33\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:33\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:33\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:33\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:37\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:51:37 -- Epoch: 0/0; Test; loss: 0.493; acc: 0.806; precision: 0.799, recall: 0.818, macrof1: 0.806, weightedf1: 0.806\u001b[0m\n",
      "--- 3.550307273864746 seconds ---\n",
      "--------- 4000\n",
      "\u001b[92m2020-09-10 21:51:37\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:37\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:37\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:37\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:37\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:51:40 -- Epoch: 0/0; Test; loss: 0.371; acc: 0.856; precision: 0.864, recall: 0.845, macrof1: 0.856, weightedf1: 0.856\u001b[0m\n",
      "--- 3.6087193489074707 seconds ---\n",
      "--------- 8000\n",
      "\u001b[92m2020-09-10 21:51:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:40\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:51:44 -- Epoch: 0/0; Test; loss: 0.263; acc: 0.902; precision: 0.886, recall: 0.921, macrof1: 0.901, weightedf1: 0.901\u001b[0m\n",
      "--- 3.484403371810913 seconds ---\n",
      "--------- 16000\n",
      "\u001b[92m2020-09-10 21:51:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:47\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:51:47 -- Epoch: 0/0; Test; loss: 0.205; acc: 0.925; precision: 0.921, recall: 0.929, macrof1: 0.925, weightedf1: 0.925\u001b[0m\n",
      "--- 3.4361681938171387 seconds ---\n",
      "--------- 32000\n",
      "\u001b[92m2020-09-10 21:51:47\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:47\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:47\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:47\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:48\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:51:51 -- Epoch: 0/0; Test; loss: 0.164; acc: 0.944; precision: 0.933, recall: 0.957, macrof1: 0.944, weightedf1: 0.944\u001b[0m\n",
      "--- 3.6248276233673096 seconds ---\n",
      "--------- 64000\n",
      "\u001b[92m2020-09-10 21:51:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:51:54 -- Epoch: 0/0; Test; loss: 0.136; acc: 0.956; precision: 0.939, recall: 0.975, macrof1: 0.956, weightedf1: 0.956\u001b[0m\n",
      "--- 3.5730488300323486 seconds ---\n",
      "--------- 84000\n",
      "\u001b[92m2020-09-10 21:51:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:54\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:55\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:55\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:55\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:58\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:51:58 -- Epoch: 0/0; Test; loss: 0.115; acc: 0.962; precision: 0.954, recall: 0.970, macrof1: 0.962, weightedf1: 0.962\u001b[0m\n",
      "--- 3.5030529499053955 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import finetune as dm_finetune\n",
    "\n",
    "for n_ft_examples in [250, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 84000]:\n",
    "    print(\"---------\", n_ft_examples)\n",
    "    # model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "    dm_inference(input_file_path=\"./inputs/input_dfm_lstm_model_B.yaml\",\n",
    "             dataset_path=\"./dataset/ocr_test.txt\", \n",
    "             pretrained_model_path=f\"./models/wikigaz_en_ft_ocr_lstm_v002_n{n_ft_examples}/wikigaz_en_ft_ocr_lstm_v002_n{n_ft_examples}.model\", \n",
    "             pretrained_vocab_path=f\"./models/wikigaz_en_ft_ocr_lstm_v002_n{n_ft_examples}/wikigaz_en_ft_ocr_lstm_v002_n{n_ft_examples}.vocab\")\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------- 250\n",
      "\u001b[92m2020-09-10 21:51:58\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:58\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:58\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:51:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:51:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:52:31\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:52:31 -- Epoch: 0/0; Test; loss: 0.210; acc: 0.914; precision: 0.903, recall: 0.928, macrof1: 0.914, weightedf1: 0.914\u001b[0m\n",
      "--- 33.10648441314697 seconds ---\n",
      "--------- 500\n",
      "\u001b[92m2020-09-10 21:52:31\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:52:31\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:52:31\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:52:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:52:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:53:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:53:04 -- Epoch: 0/0; Test; loss: 0.239; acc: 0.903; precision: 0.894, recall: 0.915, macrof1: 0.903, weightedf1: 0.903\u001b[0m\n",
      "--- 32.95087957382202 seconds ---\n",
      "--------- 1000\n",
      "\u001b[92m2020-09-10 21:53:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:53:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:53:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:53:05\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:53:05\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:53:37\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:53:37 -- Epoch: 0/0; Test; loss: 0.310; acc: 0.879; precision: 0.888, recall: 0.867, macrof1: 0.879, weightedf1: 0.879\u001b[0m\n",
      "--- 33.19973707199097 seconds ---\n",
      "--------- 2000\n",
      "\u001b[92m2020-09-10 21:53:37\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:53:37\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:53:37\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:53:38\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:53:38\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:54:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:54:11 -- Epoch: 0/0; Test; loss: 0.445; acc: 0.839; precision: 0.855, recall: 0.815, macrof1: 0.838, weightedf1: 0.838\u001b[0m\n",
      "--- 33.24075388908386 seconds ---\n",
      "--------- 4000\n",
      "\u001b[92m2020-09-10 21:54:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:54:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:54:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:54:11\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:54:12\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:54:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:54:44 -- Epoch: 0/0; Test; loss: 0.633; acc: 0.804; precision: 0.845, recall: 0.744, macrof1: 0.803, weightedf1: 0.803\u001b[0m\n",
      "--- 33.05677127838135 seconds ---\n",
      "--------- 8000\n",
      "\u001b[92m2020-09-10 21:54:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:54:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:54:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:54:44\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:54:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:55:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:55:17 -- Epoch: 0/0; Test; loss: 0.861; acc: 0.770; precision: 0.794, recall: 0.728, macrof1: 0.769, weightedf1: 0.769\u001b[0m\n",
      "--- 32.96851706504822 seconds ---\n",
      "--------- 16000\n",
      "\u001b[92m2020-09-10 21:55:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:55:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:55:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:55:17\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:55:18\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:55:50\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:55:50 -- Epoch: 0/0; Test; loss: 1.204; acc: 0.743; precision: 0.781, recall: 0.676, macrof1: 0.742, weightedf1: 0.742\u001b[0m\n",
      "--- 33.30704116821289 seconds ---\n",
      "--------- 32000\n",
      "\u001b[92m2020-09-10 21:55:50\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:55:50\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:55:50\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:55:50\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:55:51\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:56:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:56:23 -- Epoch: 0/0; Test; loss: 1.205; acc: 0.735; precision: 0.779, recall: 0.656, macrof1: 0.733, weightedf1: 0.733\u001b[0m\n",
      "--- 32.63356304168701 seconds ---\n",
      "--------- 64000\n",
      "\u001b[92m2020-09-10 21:56:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:56:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:56:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:56:23\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:56:24\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:56:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:56:56 -- Epoch: 0/0; Test; loss: 1.579; acc: 0.716; precision: 0.766, recall: 0.623, macrof1: 0.714, weightedf1: 0.714\u001b[0m\n",
      "--- 33.02063274383545 seconds ---\n",
      "--------- 84000\n",
      "\u001b[92m2020-09-10 21:56:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_lstm_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:56:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:56:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:56:56\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:56:57\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:29\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:57:29 -- Epoch: 0/0; Test; loss: 1.594; acc: 0.730; precision: 0.787, recall: 0.631, macrof1: 0.727, weightedf1: 0.727\u001b[0m\n",
      "--- 32.9434278011322 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import finetune as dm_finetune\n",
    "\n",
    "for n_ft_examples in [250, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 84000]:\n",
    "    print(\"---------\", n_ft_examples)\n",
    "    # model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "    dm_inference(input_file_path=\"./inputs/input_dfm_lstm_model_B.yaml\",\n",
    "             dataset_path=\"./dataset/wikigaz_en_test.txt\", \n",
    "             pretrained_model_path=f\"./models/wikigaz_en_ft_ocr_lstm_v002_n{n_ft_examples}/wikigaz_en_ft_ocr_lstm_v002_n{n_ft_examples}.model\", \n",
    "             pretrained_vocab_path=f\"./models/wikigaz_en_ft_ocr_lstm_v002_n{n_ft_examples}/wikigaz_en_ft_ocr_lstm_v002_n{n_ft_examples}.vocab\")\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fine-Tuned, model B, RNN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------- 250\n",
      "\u001b[92m2020-09-10 21:57:29\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:29\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:29\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:29\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:29\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:57:32 -- Epoch: 0/0; Test; loss: 0.840; acc: 0.555; precision: 0.546, recall: 0.652, macrof1: 0.550, weightedf1: 0.550\u001b[0m\n",
      "--- 3.487182378768921 seconds ---\n",
      "--------- 500\n",
      "\u001b[92m2020-09-10 21:57:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:57:35 -- Epoch: 0/0; Test; loss: 0.696; acc: 0.590; precision: 0.577, recall: 0.675, macrof1: 0.587, weightedf1: 0.587\u001b[0m\n",
      "--- 3.2782790660858154 seconds ---\n",
      "--------- 1000\n",
      "\u001b[92m2020-09-10 21:57:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:35\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:39\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:57:39 -- Epoch: 0/0; Test; loss: 0.632; acc: 0.693; precision: 0.699, recall: 0.677, macrof1: 0.693, weightedf1: 0.693\u001b[0m\n",
      "--- 3.2303292751312256 seconds ---\n",
      "--------- 2000\n",
      "\u001b[92m2020-09-10 21:57:39\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:39\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:39\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:39\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:39\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:42\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:57:42 -- Epoch: 0/0; Test; loss: 0.535; acc: 0.767; precision: 0.762, recall: 0.778, macrof1: 0.767, weightedf1: 0.767\u001b[0m\n",
      "--- 3.4540255069732666 seconds ---\n",
      "--------- 4000\n",
      "\u001b[92m2020-09-10 21:57:42\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:42\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:42\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:42\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:42\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:57:45 -- Epoch: 0/0; Test; loss: 0.447; acc: 0.803; precision: 0.803, recall: 0.804, macrof1: 0.803, weightedf1: 0.803\u001b[0m\n",
      "--- 3.2438724040985107 seconds ---\n",
      "--------- 8000\n",
      "\u001b[92m2020-09-10 21:57:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:45\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:49\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:57:49 -- Epoch: 0/0; Test; loss: 0.367; acc: 0.845; precision: 0.826, recall: 0.875, macrof1: 0.845, weightedf1: 0.845\u001b[0m\n",
      "--- 3.2652595043182373 seconds ---\n",
      "--------- 16000\n",
      "\u001b[92m2020-09-10 21:57:49\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:49\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:49\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:49\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:49\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:52\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:57:52 -- Epoch: 0/0; Test; loss: 0.304; acc: 0.882; precision: 0.880, recall: 0.885, macrof1: 0.882, weightedf1: 0.882\u001b[0m\n",
      "--- 3.279560089111328 seconds ---\n",
      "--------- 32000\n",
      "\u001b[92m2020-09-10 21:57:52\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:52\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:52\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:52\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:52\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:55\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:57:55 -- Epoch: 0/0; Test; loss: 0.236; acc: 0.906; precision: 0.900, recall: 0.913, macrof1: 0.906, weightedf1: 0.906\u001b[0m\n",
      "--- 3.399199962615967 seconds ---\n",
      "--------- 64000\n",
      "\u001b[92m2020-09-10 21:57:55\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:55\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:55\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:55\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:55\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:57:59 -- Epoch: 0/0; Test; loss: 0.199; acc: 0.923; precision: 0.911, recall: 0.937, macrof1: 0.923, weightedf1: 0.923\u001b[0m\n",
      "--- 3.2592015266418457 seconds ---\n",
      "--------- 84000\n",
      "\u001b[92m2020-09-10 21:57:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/ocr_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:57:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 4254 and False: 4254\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                    "
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:57:59\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=133.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:58:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:58:02 -- Epoch: 0/0; Test; loss: 0.169; acc: 0.937; precision: 0.929, recall: 0.947, macrof1: 0.937, weightedf1: 0.937\u001b[0m\n",
      "--- 3.2660605907440186 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import finetune as dm_finetune\n",
    "\n",
    "for n_ft_examples in [250, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 84000]:\n",
    "    print(\"---------\", n_ft_examples)\n",
    "    # model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "    dm_inference(input_file_path=\"./inputs/input_dfm_rnn_model_B.yaml\",\n",
    "             dataset_path=\"./dataset/ocr_test.txt\", \n",
    "             pretrained_model_path=f\"./models/wikigaz_en_ft_ocr_rnn_v002_n{n_ft_examples}/wikigaz_en_ft_ocr_rnn_v002_n{n_ft_examples}.model\", \n",
    "             pretrained_vocab_path=f\"./models/wikigaz_en_ft_ocr_rnn_v002_n{n_ft_examples}/wikigaz_en_ft_ocr_rnn_v002_n{n_ft_examples}.vocab\")\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--------- 250\n",
      "\u001b[92m2020-09-10 21:58:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:58:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:58:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:58:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:58:03\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:58:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:58:32 -- Epoch: 0/0; Test; loss: 0.264; acc: 0.891; precision: 0.877, recall: 0.910, macrof1: 0.891, weightedf1: 0.891\u001b[0m\n",
      "--- 30.13415813446045 seconds ---\n",
      "--------- 500\n",
      "\u001b[92m2020-09-10 21:58:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:58:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:58:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:58:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:58:33\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:59:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:59:02 -- Epoch: 0/0; Test; loss: 0.308; acc: 0.877; precision: 0.871, recall: 0.885, macrof1: 0.877, weightedf1: 0.877\u001b[0m\n",
      "--- 30.00017547607422 seconds ---\n",
      "--------- 1000\n",
      "\u001b[92m2020-09-10 21:59:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:59:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:59:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:59:03\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:59:03\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:59:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_21:59:32 -- Epoch: 0/0; Test; loss: 0.373; acc: 0.822; precision: 0.865, recall: 0.762, macrof1: 0.821, weightedf1: 0.821\u001b[0m\n",
      "--- 30.187791109085083 seconds ---\n",
      "--------- 2000\n",
      "\u001b[92m2020-09-10 21:59:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 21:59:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 21:59:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 21:59:33\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 21:59:33\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:00:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_22:00:02 -- Epoch: 0/0; Test; loss: 0.595; acc: 0.763; precision: 0.806, recall: 0.692, macrof1: 0.762, weightedf1: 0.762\u001b[0m\n",
      "--- 29.946187496185303 seconds ---\n",
      "--------- 4000\n",
      "\u001b[92m2020-09-10 22:00:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 22:00:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 22:00:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 22:00:03\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:00:04\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:00:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_22:00:32 -- Epoch: 0/0; Test; loss: 0.660; acc: 0.752; precision: 0.791, recall: 0.684, macrof1: 0.751, weightedf1: 0.751\u001b[0m\n",
      "--- 30.19029951095581 seconds ---\n",
      "--------- 8000\n",
      "\u001b[92m2020-09-10 22:00:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 22:00:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 22:00:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 22:00:33\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:00:34\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:01:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_22:01:02 -- Epoch: 0/0; Test; loss: 0.844; acc: 0.728; precision: 0.757, recall: 0.670, macrof1: 0.727, weightedf1: 0.727\u001b[0m\n",
      "--- 29.614692211151123 seconds ---\n",
      "--------- 16000\n",
      "\u001b[92m2020-09-10 22:01:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 22:01:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 22:01:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 22:01:03\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:01:03\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:01:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_22:01:32 -- Epoch: 0/0; Test; loss: 1.408; acc: 0.698; precision: 0.747, recall: 0.598, macrof1: 0.695, weightedf1: 0.695\u001b[0m\n",
      "--- 29.56263279914856 seconds ---\n",
      "--------- 32000\n",
      "\u001b[92m2020-09-10 22:01:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 22:01:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 22:01:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 22:01:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:01:33\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:02:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_22:02:01 -- Epoch: 0/0; Test; loss: 1.185; acc: 0.699; precision: 0.744, recall: 0.605, macrof1: 0.696, weightedf1: 0.696\u001b[0m\n",
      "--- 29.859045267105103 seconds ---\n",
      "--------- 64000\n",
      "\u001b[92m2020-09-10 22:02:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 22:02:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 22:02:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 22:02:02\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:02:03\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:02:31\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_22:02:31 -- Epoch: 0/0; Test; loss: 1.259; acc: 0.697; precision: 0.733, recall: 0.621, macrof1: 0.695, weightedf1: 0.695\u001b[0m\n",
      "--- 30.01087999343872 seconds ---\n",
      "--------- 84000\n",
      "\u001b[92m2020-09-10 22:02:31\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread input file: ./inputs/input_dfm_rnn_model_B.yaml\u001b[0m\n",
      "\u001b[92m2020-09-10 22:02:31\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mpytorch will use: cuda:1\u001b[0m\n",
      "\u001b[92m2020-09-10 22:02:31\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mread CSV file: ./dataset/wikigaz_en_test.txt\u001b[0m\n",
      "\u001b[92m2020-09-10 22:02:32\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;32mnumber of labels, True: 33469 and False: 33469\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "length s2:   0%|          | 0/66938 [00:00<?, ?it/s]"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:02:33\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[2;32mskipping 0 lines\u001b[0m\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "                                                                     \r"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "HBox(children=(FloatProgress(value=0.0, max=1046.0), HTML(value='')))"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[92m2020-09-10 22:03:01\u001b[0m \u001b[95mlwm-embeddings\u001b[0m \u001b[1m\u001b[90m[INFO]\u001b[0m \u001b[1;31m09/10/2020_22:03:01 -- Epoch: 0/0; Test; loss: 1.408; acc: 0.695; precision: 0.756, recall: 0.575, macrof1: 0.690, weightedf1: 0.690\u001b[0m\n",
      "--- 29.845705032348633 seconds ---\n"
     ]
    }
   ],
   "source": [
    "from DeezyMatch import finetune as dm_finetune\n",
    "\n",
    "for n_ft_examples in [250, 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000, 84000]:\n",
    "    print(\"---------\", n_ft_examples)\n",
    "    # model inference using a model stored at pretrained_model_path and pretrained_vocab_path \n",
    "    dm_inference(input_file_path=\"./inputs/input_dfm_rnn_model_B.yaml\",\n",
    "             dataset_path=\"./dataset/wikigaz_en_test.txt\", \n",
    "             pretrained_model_path=f\"./models/wikigaz_en_ft_ocr_rnn_v002_n{n_ft_examples}/wikigaz_en_ft_ocr_rnn_v002_n{n_ft_examples}.model\", \n",
    "             pretrained_vocab_path=f\"./models/wikigaz_en_ft_ocr_rnn_v002_n{n_ft_examples}/wikigaz_en_ft_ocr_rnn_v002_n{n_ft_examples}.vocab\")\n",
    "    \n",
    "    "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
